Speaker recognition

ABSTRACT

This invention relates to an improved method and apparatus for speaker recognition. In this invention, prior to comparing feature vectors derived from speech with a stored reference model the feature vectors are processed by applying a speaker dependent transform which matches the characteristics of a particular speaker&#39;s vocal tract. Features derived from speech which has very dissimilar characteristics to those of the speaker on which the transform is dependent may be severely distorted by the transform, whereas features from speech which has similar characteristics to those of the speaker on which the transform is dependent will be distorted far less.

FIELD OF THE INVENTION

The present invention relates to speaker recognition. In speakerrecognition the identity of the speaker is identified or verified. Inspeaker identification a speaker is identified either as being one of agroup of known speakers, or is rejected as being an unknown speaker. Inspeaker verification the speaker is either accepted as having a claimedidentity or rejected. The speaker may input a claimed identity, forexample, by means of a password, a personal identification number or aswipe card.

BACKGROUND OF THE INVENTION

In general, for speaker recognition, speech processing aims to increasethe effects on the spoken word of different speakers, whereas for speechrecognition, in which a particular word (or, sometimes, a phrase or aphoneme, or other spoken matter), is recognised, speech processing aimsto reduce the effects on the spoken word of different speakers.

It is common in to input speech data, typically in digital form, to afront-end processor, which derives from the stream of input speech datamore compact, more perceptually significant data referred to as inputfeature vectors (or sometimes as front-end feature vectors). Where thespeaker speaks a predetermined word, known to the recognition apparatusand to the speaker (e.g. a personal identification number in banking)the technique is known as ‘text-dependent’. In some applications ofspeaker recognition a technique is used which does not require thecontent of the speech to be predetermined, such techniques are known as‘text independent’ techniques.

In text-dependent techniques a stored representation of the word, knownas a template or model, is previously derived from a speaker known to begenuine. The input feature vectors derived from the speaker to berecognised are compared with the template and a measure of similaritybetween the two is compared with a threshold for an acceptance decision.Comparison may be done by means of Dynamic Time Warping as described in“On the evaluation of Speech Recognisers and Data Bases using aReference System”, Chollet & Gagnoulet, 1982 IEEE, InternationalConference on Acoustics, Speech and Signal Processing, pp 2026-2029.Other means of comparison include Hidden Markov Model processing andNeural Networks. These techniques are described in British TelecomTechnology Journal, Vol. 6, No. 2 Apr. 1988, “Hidden Markov Models forAutomatic Speech Recognition : Theory And Application”, SJ Cox pages105-115, “Multi-layer perceptrons applied to speech technology”,McCullogh et al, pages 131-139 and “Neural arrays for speechrecognition”, Tattershall et al pages 140-163.

Various types of features have been used or proposed for speechprocessing. In general, since the types of features used for speechrecognition are intended to distinguish one word from another withoutsensitivity to the speaker whereas those for speaker recognition areintended to distinguish between speakers for a known word or words, atype of feature suitable for one type of recognition may be unsuitablefor the other. Some types of feature suitable for speaker recognitionare described in “Automatic Recognition of Speakers from their voices”,Atal, Proc IEEE vol 64 pp 460-475, April 1976.

SUMMARY OF THE INVENTION

According to the present invention there is provided a method of speakerrecognition comprising the steps of receiving a speech signal from anunknown speaker; transforming the received speech signal according to atransform, the transform being associated with a particular speaker;comparing the transformed speech signal with a model representing saidparticular speaker; and providing as an output a parameter which dependsupon the likelihood that the unknown speaker is said particular speaker.

Preferably the transforming step comprises the substeps of detecting aspeech start point and a speech end point within the received speechsignal; generating a sequence of feature vectors derived from thereceived speech signal; and aligning the sequence of feature vectorscorresponding to the speech signal between the detected start point andthe detected end point with a representative sequence of feature vectorsfor said particular speaker such that each feature vector in the alignedsequence of feature vectors corresponds to a feature vector in therepresentative sequence of feature vectors.

Advantageously the transforming step further comprises the substep ofaveraging each feature vector in the aligned sequence of feature vectorswith the corresponding feature vector in the representative sequence offeature vectors.

Preferably he model is a Hidden Markov Model and may be a left to rightHidden Markov Model.

Advantageously the representative sequence of feature vectors comprisesthe same number of feature vectors as the number of states in the HiddenMarkov Model.

According to another aspect of the present invention there is providedan apparatus for speaker recognition comprising receiving means forreceiving a speech signal from an unknown speaker; a speaker transformstore for storing a plurality of speaker transforms each transform beingassociated with a respective one of a plurality of speakers; a speakermodel store for storing a plurality of speaker models each speaker modelbeing associated with a respective one of said plurality of speakers;transforming means coupled to the receiving means and the speakertransform store, and arranged in operation to transform the receivedspeech signal according to a selected speaker transform; comparing meanscoupled to the transforming means and the speaker model store, andarranged in operation to compare the transformed speech signal with thecorresponding speaker model; and output means for providing a signalindicative of the likelihood that the unknown speaker is the speakerassociated with the selected speaker transform.

Preferably the transform store stores each of said transforms as arepresentative sequence of feature vectors; and the transforming meanscomprises a start point and end point detector for detecting a speechstart point and a speech end point within the received speech signal, afeature vector generator for generating a sequence of feature vectorsderived from the input speech, and aligning means for aligning thesequence of feature vectors corresponding to the speech signal betweenthe detected start point and the detected end point with arepresentative sequence of feature vectors such that each feature vectorin the resulting aligned sequence of feature vectors corresponds to afeature vector in the representative sequence of feature vectors.

Advantageously the transforming means further comprises averaging meansfor averaging each feature vector in the aligned sequence of featurevectors with the corresponding feature vector in the representativesequence of feature vectors. Preferably the speaker model store isarranged to store the speaker model in the form of a Hidden Markov Modeland may be arranged to store the speaker model in the form of a left toright Hidden Markov Model.

Advantageously the stored representative sequence of feature vectorscomprises the same number of vectors as the number of states in thecorresponding stored Hidden Markov Model.

It is well known that a speaker's vocal tract during speech productionmay be modelled as a time varying filter. In this invention, prior tocomparing feature vectors derived from speech with a stored referencemodel the feature vectors are processed by applying a speaker dependenttransform which matches the. characteristics of a particular speaker'svocal tract. Features derived from speech which has very dissimilarcharacteristics to those of the speaker on which the transform isdependent may be severely distorted by the transform, whereas featuresfrom speech which has similar characteristics to those of the speaker onwhich the transform is dependent will be distorted far less. Such aspeaker dependent transform may be viewed as a similar process to thatof conventional matched filtering in which a signal filtered using amatched filter suffers no distortion. Features which have beentransformed in this way thus provide more discrimination betweenspeakers. Such transformed features are then used in a conventionalspeaker recognition comparison process.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described by way of example only, withreference to the drawings, in which:

FIG. 1 shows a telecommunications system incorporating a recognitionprocessor;

FIG. 2 shows part of the recognition processor of FIG. 1 incorporating aspectral signal extractor;

FIG. 3 shows the spectral signal extractor of FIG. 2;

FIG. 4 a is a flow diagram showing the operation of the recognitionprocessor of FIG. 1 during speaker verification;

FIG. 4 b is a flow diagram showing the operation of the recognitionprocessor of FIG. 1 during speaker identification;

FIG. 5 shows an example of a warping function between two featurevectors M and R;

FIG. 6 shows an example of a weighting function which may be appliedduring warping;

FIG. 7 is a flow diagram showing calculation of time normalised distancebetween two feature vectors;

FIG. 8 is an example of a Markov Model;

FIG. 9 shows the transition matrix and an example of an initialisationvector for the Markov Model of FIG. 8;

FIG. 10 illustrates the computation of forward probabilities for a sixstate Hidden Markov Model; and

FIG. 11 illustrates a possible state sequence calculated using theViterbi algorithm.

DETAILED DESCRIPTION OF THE INVENTION

In FIG. 1 there is shown a telecommunications system including speakerrecognition apparatus comprising a microphone, 1, typically forming partof a telephone handset, a telecommunications network 2 (for example apublic switched telecommunications network (PSTN) or a digitaltelecommunications network), a recognition processor 3 connected toreceive a voice signal from the network 2, and a utilising apparatus 4connected to the recognition processor 3 and arranged to receivetherefrom a voice recognition signal, indicating recognition orotherwise of a particular speaker, and to take action in responsethereto. For example, the utilising apparatus 4 may be a remotelyoperated banking terminal for effecting banking transactions. In manycases, the utilising apparatus 4 will generate an audible response to auser, transmitted via the network 2 to a loudspeaker 5 typically forminga part of the telephone handset.

In operation, a speaker speaks into the microphone 1 and an analoguespeech signal is transmitted from the microphone 1 into the network 2 tothe recognition processor 3, where the speech signal is analysed and asignal indicating recognition or otherwise of a particular speaker isgenerated and transmitted to the utilising apparatus 4, which then takesappropriate action in the event of recognition or otherwise of aparticular speaker. If the recognition processor is performing speakeridentification then the signal indicates either the identified speakeror that the speaker has been rejected. If the recognition processor isperforming speaker verification the signal indicates that the speaker isor is not the claimed speaker.

The recognition processor needs to acquire data concerning the identityof speakers against which to compare the speech signal. This dataacquisition may be performed by the recognition processor in a secondmode of operation in which the recognition processor 3 is not connectedto the utilising apparatus 4, but receives a speech signal from themicrophone 1 to form the recognition data for that speaker. However,other methods of acquiring the speaker recognition data are alsopossible; for example, speaker recognition data may be held on a cardcarried by the speaker and insertable into a card reader, from which thedata is read and transmitted through the network to the recognitionprocessor prior to transmission of the speech signal.

Typically, the recognition processor 3 is ignorant of the route taken bythe signal from the microphone 1 to and through the network 2; themicrophone 1 may, for instance be connected through a mobile analogue ordigital radio link to the network 2, or may originate from anothercountry. The microphone may be part of one of a wide variety of typesand qualities of receiver handset. Likewise, within the network 2, anyone of a large variety of transmission paths may be taken, includingradio links, analogue and digital paths and so on.

FIG. 2 shows part of the recognition processor 3. Digital speech isreceived by a spectral signal extractor 20, for example, from a digitaltelephone network, or from an analogue to digital converter. A number offeature vectors, each of which represents a number of contiguous digitalsamples, are derived from the digital speech. For example, speechsamples may be received at a sampling rate of 8 kHz, and a featurevector may represent a frame of 256 contiguous samples, i.e. 32 ms ofspeech.

The spectral signal extractor 20 provides feature vectors to an endpointdetector 24 which provides as outputs signals indicating the startpointand endpoint of the received speech. The feature vectors are also storedin frame buffers 25 prior to processing by a speaker recognitionprocessor 21.

The start and endpoints of speech are be provided using a conventionalenergy based endpointer. In an improved technique signals from a speechrecogniser configured to recognise the specific word may be used.

A plurality of feature vectors are received by the speaker recognitionprocessor 21, which reads a speaker dependent transform matrixassociated with a particular speaker from a speaker transform store 22and a reference model associated with the particular speaker from aspeaker model store 23. The speaker recognition processor then processesthe received feature vectors in dependence upon the retrieved speakertransform matrix and model and generates an output signal in dependenceupon the likelihood that the speaker represented by retrieved model andspeaker dependent transform produced the speech represented by thereceived feature vectors. The operation of the speaker recognitionprocessor will be described more fully later with reference to FIG. 4 aand FIG. 4 b. The speaker recognition processor 21 constitutes thetransforming means the comparing means and the output means of thepresent invention.

Referring now to FIG. 3, the operation of the spectral signal extractor20 will now be described in more detail. A high emphasis filter 10receives the digitised speech waveform at, for example, a sampling rateof 8 kHz as a sequence of 8-bit numbers and performs a high emphasisfiltering process (for example by executing a 1−0.95.z⁻¹ filter), toincrease the amplitude of higher frequencies.

A frame of contiguous samples of the filtered signal is windowed by awindow processor 11 (i.e. the samples are multiplied by predeterminedweighting constants) using, for example, a Hamming window, to reducespurious artefacts generated by the frame's edges. In a preferredembodiment the frames are overlapping for example by 50%, so as toprovide, in this example one frame every 16 ms.

Each frame of 256 windowed samples is then processed by a Mel FrequencyCepstral Coefficient (MFCC) generator 12 to extract an MFCC featurevector comprising a set of MFCC's (for example 8 coefficients).

The MFCC feature vector is derived by performing a spectral transforme.g. a Fast Fourier Transform (FFT), on each frame of a speech signal,to derive a signal spectrum; integrating the terms of the spectrum intoa series of broad bands, which are distributed in a ‘mel-frequency’scale along the frequency axis; taking the logarithms of the magnitudein each band; and then performing a further transform (e.g. a DiscreteCosine Transform (DCT)) to generate the MFCC coefficient set for theframe. It is found that the useful information is generally confined tothe lower order coefficients. The mel-frequency scale is frequency bandsevenly spaced on a linear frequency scale between 0 and 1 kHz, andevenly spaced on a logarithmic frequency scale above 1 kHz.

The high emphasis filter 10, window processor 11, MFCC generator 12, endpoint detector 24, and speaker recognition processor 21 may be providedby one or more suitably programmed digital signal processor (DSP)devices and/or microprocessors. The frame buffers 25, speaker transformstore 22 and speaker model store 23 may be provided within read/writememory devices connected to such processor devices.

FIG. 4 a indicates schematically the operation of the speakerrecognition processor 21 during speaker verification. The speakerrecognition processor receives a sequence of feature vectors at step 40and a detected start point and a detected end point from the endpointdetector 11. At step 41 the speaker recognition processor selects aspeaker dependent transform matrix from the speaker transform store 22for the speaker a user is claiming to be and reads a corresponding modelwhich represents the same speaker as the representative feature matrixfrom the speaker model store 23.

The speaker dependent transform matrix represents a particular word fora particular speaker. It comprises a representative sequence of featurevectors of the represented word when uttered by the represented speaker.The speaker dependent transform matrix is also referred to herein as thesequence of representative feature vectors. The received sequence offeature vectors corresponding to the speech signal between the detectedstart point and the detected end point is time aligned with the speakerdependent transform matrix using the dynamic time warp (DTW) process atstep 42.

The time alignment performed at step 42 will now be described in moredetail with reference to FIGS. 5, 6 and 7.

The speaker dependent transform matrix comprises a representativesequence of feature vectors for a particular word.M m₁, m₂, . . . m_(i), . . . m_(I)

A sequence of feature vectors

 R r₁, r₂, . . . r_(j) . . . r_(J)

is received The received sequence of feature vectors is time alignedwith the representative sequence of feature vectors as follows.

Referring to FIG. 5 the representative sequence is represented along thei-axis and the received sequence is represented along the j-axis.

The sequence of points C=(i, j) represents a “warping” function F whichapproximately realises a mapping from the time axis of the receivedsequence of feature vectors onto that of the representative sequence offeature vectors.F c(1), c(2), . . . ,c(k), . . . c(K) where c(k)=(r(k),m(k))

As a measure of the difference between two feature vectors M and R adistance d(c) d(i,j) ∥m_(i) r_(j)∥ is used. The summation of thedistances on the warping function is$\underset{k\quad 1}{\overset{K}{f}}{d\left( {c(k)} \right)}$which gives measure of how well the warping function F maps one set offeature vectors onto another. This measure reaches a minimum value whenF is determined to optimally adjust timing differences between the twosequences of feature vectors. Alternatively a weighting function may beemployed so that a weighted summation is used.$\underset{k\quad 1}{\overset{K}{f}}{{d\left( {c(k)} \right)} \cdot {Z(k)}}$and Z(k) is used to weight the distance measures. One example of aweighting function is:Z(K) (i(K) i(K 1)) (j(K) j(K 1))which is shown diagrammatically in FIG. 6

The time normalised distance between two sequences of vectors is definedas ${D\left( {M,R} \right)}\quad\overset{Min}{F}\begin{matrix}\spadesuit & \quad & \equiv \\\leftrightarrow & {\underset{k\quad 1}{\overset{K}{f}}\quad{{d\left( {c(k)} \right)} \cdot {Z(k)}}} & \approx \\\leftrightarrow & \quad & \approx \\\leftrightarrow & \quad & \approx \\\leftrightarrow & {\underset{k\quad 1}{\overset{K}{f}}\quad{Z(k)}} & \approx \\\leftarrow & \quad & \ldots\end{matrix}$

Various restrictions can be imposed on the warping function F asdescribed in “Dynamic Programming Algorithm Optimisation for Spoken WordRecognitions”, Sakoe and Chiba, IEEE Transactions on Acoustics Speechand Signal Processing, vol, 26, No. 1, February 1978. The equations forcalculating the time normalised distance together with the warpingfunction F which provides the minimum value required are as follows:${g_{1}\left( {c(1)} \right)}\quad{d\left( {{{{c(1)} \cdot {Z(1)}}{g_{k}\left( {c(k)} \right)}}\quad\underset{c{({k\quad 1})}}{\overset{Min}{\quad}} > {g_{k\quad 1}\left( {c\left( {{\begin{matrix}k & \left. \left. 1 \right) \right)\end{matrix}{{d\left( {c(k)} \right)} \cdot {Z(k)}}}\hat{=}} \right.} \right.}} \right.}$

which is known as the “dynamic programming” equation

the time normalised distance is${D\left( {M,R} \right)}\frac{1}{\underset{k\quad 1}{\overset{K}{f}}{Z(k)}}{{{gK}\left( {c(k)} \right)}.}$

If the weighting function shown earlier is used then the dynamicprogramming (DP) equation becomes${g\left( {i,j} \right)}\quad{Min}\begin{matrix}{{\spadesuit g}\left( {i,{j\quad 1}} \right)} & {d\left( {i,j} \right)} & \equiv \\{{\,_{\leftrightarrow}^{\leftrightarrow}g}\left( {{i\quad 1},{j\quad 1}} \right)} & {2{d\left( {i,j} \right)}} & \quad_{\approx}^{\approx} \\{{\,_{\leftarrow}^{\leftarrow}g}\left( {{i\quad 1},j} \right)} & {d\left( {i,j} \right)} & \quad_{\ldots}^{\approx}\end{matrix}\quad{and}\quad\underset{k\quad 1}{\overset{K}{f}}{Z(k)}\quad I\quad J$

A flow chart showing the calculation of the time normalised distanceusing the weight function of FIG. 6 is shown in FIG. 7.

At step 74 i and j are initialised to be equal to 1. At step 76 theinitial value of g(1,1) is set to be equal to m₁−r₁ (d(1,1)) multipliedby 2 (according to the weighting function w). Then i is increased by 1at step 78 and unless i is greater than I at step 80 the dynamicprogramming equation is calculated at step 86. If i is greater than Ithen j is incremented at step 88 and i reset to 1 at step 96. Steps 78and 86 are then repeated until eventually the dynamic programmingequation has been calculated for all values of I and J, then the timenormalised distance is calculated at step 92.

In a more efficient algorithm the dynamic programming equation is onlycalculated for values within a restricting window of size r such thatj rδiδj r

The warping function F may then be determined by “backtracking” asfollows: $\quad\begin{matrix}{C(K)} & \left( {I,J} \right) & \quad & \quad & \quad \\{C\left( {k\quad 1} \right)} & {i,j,{{for}\quad{which}}} & {g\left( {i,{j\quad 1}} \right)} & |\limits_{\circ} & \quad \\\quad & \quad & {g\left( {{i\quad 1},{j\quad 1}} \right)} & \quad & {{is}\quad a\quad{minimum}} \\\quad & \quad & {g\left( {{i\quad 1},j} \right)} & \overset{\circ}{─|} & \quad\end{matrix}$Once the warping function FC(1), C(2), C(3), . . . C(k) . . . C(K)is known, where C(k) (r(k),m(k)) then it is possible to determine asequence of “time aligned” received feature vectorsZ Z₁,Z₂, . . . ,Z_(I)

In the example shown in FIG. 5

-   -   C(1) (1,1)    -   C(2) (1,2)    -   C(3) (2,2)    -   C(4) (3,3)    -   C(5) (4,3)        i.e. r₁ is mapped to m₁, r₁ is mapped to m₂, r₂ is mapped to m₂,        r₃ is mapped to m3 etc.

It can be seen that both r₁ and r₂ have been mapped onto m₂ and adecision has to be made as to which received feature vector should beused for the time aligned feature vector in this case. An alternative tochoosing one of the received feature vectors is to calculate an averageof received feature vectors which map onto a single representativefeature vector.

If the first such received feature vector is used, then Z_(p) r_(q)$q\quad\underset{j{(k)}}{Min}\quad{i(k)}\quad p$or if the last such received feature vector is used then Z_(p) r_(s)$s\quad\underset{j{(k)}}{Max}\quad{i(k)}\quad p$or if an average is usedZ_(p) Ave (r_(j(k))) i(k) p

So, in the example of FIG. 5, assuming the first such received vector isused

-   -   Z₁ r₁    -   Z₂ r₂    -   Z₃ r₃    -   Z₄ r₄        etc.

It will be appreciated that such an alignment process results in analigned sequence of feature vectors for which each feature vector in thealigned sequence of feature vectors corresponds to a feature vector inthe representative sequence of feature vectors.

Referring again to FIG. 4 a, in an improved version of the transformingprocess, each of the time aligned received feature vectors is alsoaveraged with the corresponding feature vector of the speaker dependenttransform matrix at the optional step 43. If the time aligned receivedfeature vectors are substantially different from the correspondingfeature vectors of the speaker dependent transform matrix then such anaveraging step will severely distort the time aligned received featurevectors, whereas if the time aligned received feature vectors aresimilar to the speaker dependent transform matrix then the averagingprocess will distort the received feature matrix very little. Suchtransformed features should increase the discrimination in anysubsequent comparison process.

The transformed features are then used in a conventional speakerrecognition comparison process at step 44. In this embodiment of theinvention the speaker model is provided by a left to right Hidden MarkovModel, and the comparison is performed using the Viterbi algorithm aswill be described shortly with reference to FIGS. 8 to 11. A distancemeasure indicating the likelihood that the represented speaker producedthe speech represented by the received feature vectors is generated andis subsequently compared with a threshold at step 45. If the differenceis less than the threshold, the speaker is accepted as corresponding tothe stored template at step 47; otherwise the speaker is rejected atstep 46.

The principles of modelling speech using Hidden Markov Models andViterbi recognition will now be described with reference to FIGS. 8 to11.

FIG. 8 shows an example HMM. The five circles 100,102,104,106 and 108represent the states of the HMM and, at a discrete time instant t themodel is considered to be in one of the states and is considered to emitan observation O_(t). In speech or speaker recognition each observationgenerally corresponds to a feature vector.

At instant t+1, the model either moves to a new state or stays in thesame state and in either case emits another observation and so on. Theobservation emitted depends only on the current state of the model. Thestate occupied at time t+1 depends only on the state occupied at time t(this property is known as the Markov property). The probabilities ofmoving from one state to another may be tabulated in an N×N statetransition matrix (A=[a_(i,j)]) as shown in FIG. 9. The entry in thei^(th) row and the j^(th) column of the matrix is the probability ofmoving from state s_(i) at time t to state s_(j) at time t+1. As theprobability of moving from a state is 1.0 (if the model stays in thesame state then this is considered to be a transition to itself), eachrow of the matrix sums to 1.0. In the example shown the state transitionmatrix only has entries in the upper triangle because this example is aleft to right model in which no “backwards” transitions are allowed. Ina more general HMM transitions may be made from any state to any otherstate. Also shown is an initialisation vector (Σ) whose i^(th) componentis the probability of occupying state S_(i) at time t=1.

Assuming that W such models exist M₁ . . . M_(w), each representing aparticular speaker and that a speech signal from an unknown speaker isrepresented by a sequence of T observations O₁,O₂,O₃ . . . O_(T), theproblem is then to determine which model is most likely to have emittedthis sequence of observations, i.e. to determine k where$P_{k}\quad{\max\limits_{\underset{{i\quad 1},2,3,\ldots\quad,W}{︸}}{{\Pr\left( {O❘M_{i}} \right)}.}}$Pr(O|M) is calculated recursively as follows:

The forward probability Δ_(t)(j) is defined to be the probability of amodel emitting the partial observation sequence O₁,O₂, . . . O_(t) andoccupying state S_(j) at time t.${Therefore},{{\Pr\left( {O❘M} \right)}\underset{j\quad 1}{\overset{N}{f}}{\Delta_{T}(j)}}$

The probability of the model occupying state S_(j) at time t+1 andemitting observation O_(t+1) may be calculated from the forwardprobabilities at time t, the state transition probabilities (a_(i,j))and the probability b_(t)(O_(t+1)) that state S_(j) emits theobservation O_(t+1) as follows:${\Delta_{t\quad 1}(j)}\quad\underset{j\quad 1}{\overset{N}{f}}\quad{\Delta_{t}(i)}a_{i,j}{\,_{\underset{\neq}{\div}}^{\bullet}\quad b_{j}}\quad\left( O_{t\quad 1} \right)$

FIG. 10 illustrates the computation of Δ_(t+1)(4) for a six state HMM.

The recursion is initialised by setting Δ_(t)(j)=Σ(j)b_(i)(O₁).

A computationally more efficient variant of the above algorithm is knownas the Viterbi algorithm. In the Viterbi algorithm instead of summingthe forward probabilities as described, the maximum of the forwardprobabilities is used. ${i.e.\quad{I_{t\quad 1}(j)}}\quad\begin{matrix}{Max} \\{1,2,\ldots\quad,N}\end{matrix}{I_{t}(i)}a_{i,j}{b_{j}\left( O_{t\quad 1} \right)}$

If it is required to recover the most likely state sequence then eachtime I_(t) is calculated ∴_(t)(j) is recorded, where ∴_(t)(j) is themost likely state at time t−1 given state s_(j) at time t i. e. thestate which maximises the right hand side of the above equation. Themost likely state at time T is that state s_(k) for which I_(T)(j) is amaximum and ∴_(T)(k) gives the most likely state at time T−1 and so on.

FIG. 11 illustrates a possible state sequence calculated using theViterbi algorithm for an observation (or feature vector) sequence ofsixteen frames and a five state left to right Hidden Markov Model.

FIG. 4 b shows the corresponding operation of the speaker recognitionprocessor 21 in speaker identification; in this case, a plurality ofspeaker transforms and corresponding speaker models are used. Eachspeaker dependent transform is selected in turn and is used to timealign received feature vectors at step 42. The time aligned sequence ofreceived feature vectors is then compared with the corresponding speakermodel at step 48. As described earlier with reference to FIG. 4 a, eachof the time aligned received feature vectors may also be averaged withthe corresponding feature vector of the speaker dependent transformmatrix at the optional step 43. The speaker is then identified as theknown speaker with the distance measure indicating the greatestlikelihood that the known speaker corresponds to the unknown speaker.However, if the smallest distance measure is greater than a threshold atstep 53, indicating that none of the speakers have a particularly highlikelihood of being the unknown speaker then the speaker is rejected atstep 54 as being unknown to the system.

Historically, a DTW comparison process has worked better for speakerrecognition than a HMM comparison process. One difference betweencomparing a sequence of feature vectors with a Hidden Markov Model andcomparing the same sequence of feature vectors with a representativetemplate using a Dynamic Time Warp algorithm is in the pattern matchingstage. In a DTW approach one received feature vector may be matched totwo or more representative feature vectors, corresponding to ahorizontal path in FIG. 5. However, in a HMM approach each receivedfeature vector may only be matched to one state. It is not possible tohave a horizontal path in FIG. 11. Aligning the sequence of receivedfeature vectors with the speaker dependent transform matrix allows morepossibilities for mapping received feature vectors to HMM states, andhence can improve the performance of an HMM based speaker recogniser.

Another difference between an HMM based speaker recogniser and a DTWbased speaker recogniser is that DTW templates are based entirely on oneindividual's speech whereas a single HMM topology is often defined priorto training a set of models with an individual's speech. In an improvedembodiment of the invention the speaker models are provided by HMM'swhich have differing numbers of states depending upon each individual'straining speech. For example, the minimum number of feature vectors in aset. of a particular individual's training utterances for a particularword may be used to select the number of states used for the HMM forthat particular word for the particular individual. The number offeatures in the speaker dependent transform matrix may be similarlydefined, in which case the number of features in the sequence ofrepresentative feature vectors will be the same as the number of statesin the Hidden Markov Model.

The invention has been described with reference to MFCC's, but it willbe appreciated that any suitable spectral representation may be used.For example, Linear Prediction Coefficient (LPC) cepstral coefficients,Fast Fourier Transform (FFT) cepstral coefficients Line Spectral Pair(LSP) coefficients etc.

Whilst a comparison process using the Hidden Markov Models has beendiscussed, the invention is equally applicable to speaker recognitionemploying other types of comparison process, for example, dynamic timewarp techniques or neural network techniques.

The present invention employs a speaker dependent transform for the oreach speaker to be identified. In the embodiment of the inventiondescribed here speaker dependent transform matrices are provided bymeans of a representative sequence of feature vectors for each word.

Methods of deriving representative sequences of feature vectors are wellknown, and for understanding the present invention it is sufficient toindicate that each representative sequence of feature vectors may beformed by a process of receiving a plurality of utterances of the sameword by a speaker and deriving a set of feature vectors as describedabove for each of the utterances. The sequences are then time aligned,as described previously, for example, and then the time alignedsequences of feature vectors for the plurality of utterances areaveraged to derive an averaged sequence of feature vectors whichprovides the speaker dependent transform matrix.

1. A method of speaker recognition comprising the steps of receiving aspeech signal from an unknown speaker; transforming the received speechsignal according to a transform, the transform being associated with aparticular speaker; comparing the transformed speech signal with a modelrepresenting said particular speaker; and providing as an output aparameter which depends upon the likelihood that the unknown speaker issaid particular speaker; wherein the transforming step is furtherarranged such that a received speech signal which has dissimilarcharacteristics to those of the particular speaker is more distorted bythe transform, whereas a received speech signal which has similar oridentical characteristics to those of the particular speaker will beless distorted by the transform.
 2. A method according to claim 1, inwhich the transforming step comprises the substeps of detecting a speechstart point and a speech end point within the received speech signal;generating a sequence of feature vectors derived from the receivedspeech signal; and aligning the sequence of feature vectorscorresponding to the speech signal between the detected start point andthe detected end point with a representative sequence of feature vectorsfor said particular speaker such that each feature vector in the alignedsequence of feature vectors corresponds to a feature vector in therepresentative sequence of feature vectors.
 3. A method according toclaim 2, in which the transforming step further comprises the substep ofaveraging each feature vector in the aligned sequence of feature vectorswith the corresponding feature vector in the representative sequence offeature vectors.
 4. A method according to claim 2 in which the model isa left to right Hidden Markov Model and in which the representativesequence of feature vectors comprises the same number of feature vectorsas the number of states in the Hidden Markov Model.
 5. A methodaccording to claim 1 in which the model is a Hidden Markov Model.
 6. Amethod according to claim 5, in which the model is a left to rightHidden Markov Model.
 7. An apparatus for speaker recognition comprisingreceiving means for receiving a speech signal from an unknown speaker; aspeaker transform store for storing a plurality of speaker transformseach transform being associated with a respective one of a plurality ofspeakers; a speaker model store for storing a plurality of speakermodels each speaker model being associated with a respective one of saidplurality of speakers; transforming means coupled to the receiving meansand the speaker transform store, and arranged in operation to transformthe received speech signal according to a selected speaker transform;comparing means coupled to the transforming means and the speaker modelstore, and arranged in operation to compare the transformed speechsignal with the corresponding speaker model; and output means forproviding a signal indicative of the likelihood that the unknown speakeris the speaker associated with the selected speaker transform; whereineach transform is further arranged such that a received speech signalwhich has dissimilar characteristics to those of the respective speakerto which a transform relates is more distorted by the transform, whereasa received speech signal which has similar or identical characteristicsto those of the respective speaker will be less distorted by thetransform.
 8. An apparatus according to claim 7, in which the transformstore stores each of said transforms as a representative sequence offeature vectors; and in which the transforming means comprises a startpoint and end point detector for detecting a speech start point and aspeech end point within the received speech signal, a feature vectorgenerator for generating a sequence of feature vectors derived 35 fromthe input speech, and aligning means for aligning the sequence offeature vectors corresponding to the speech signal between the detectedstart point and the detected end point with a representative sequence offeature vectors such that each feature vector in the resulting alignedsequence of feature vectors corresponds to a feature vector in therepresentative sequence of feature vectors.
 9. An apparatus according toclaim 8, in which the transforming means further comprises averagingmeans for averaging each feature vector in the aligned sequence offeature vectors with the corresponding feature vector in therepresentative sequence of feature vectors.
 10. An apparatus accordingto claim 8 in which the speaker model store is arranged to store thespeaker model in the form of a left to right Hidden Markov Model. and inwhich the stored representative sequence of feature vectors comprisesthe same number of vectors as the number of states in the correspondingstored Hidden Markov Model.
 11. An apparatus according to claim 7, inwhich the speaker model store is arranged to store the speaker model inthe form of a Hidden Markov Model.
 12. An apparatus according to claim11 in which the speaker model store is arranged to store the speakermodel in the form of a left to right Hidden Markov Model.
 13. A methodof speaker recognition comprising the steps of receiving a speech signalfrom an unknown speaker; transforming the received speech signalaccording to a transform, the transform being associated with aparticular speaker; comparing the transformed speech signal with a modelrepresenting said particular speaker; and providing as an output aparameter which depends upon the likelihood that the unknown speaker issaid particular speaker; the method being characterised in that thetransforming step comprises the substeps of: detecting a speech startpoint and a speech end point within the received speech signal;generating a sequence of feature vectors derived from the receivedspeech signal; and aligning the sequence of feature vectorscorresponding to the speech signal between the detected start point andthe detected end point with a representative sequence of feature vectorsfor said particular speaker such that each feature vector in the alignedsequence of feature vectors corresponds to a feature vector in therepresentative sequence of feature vectors; the aligning step furthercomprising the sub-step of, where a plurality of feature vectors in thealigned sequence of feature vectors correspond to a particular samefeature vector in the representative sequence of feature vectors,selecting one of or an average of the plurality of feature vectors inthe aligned sequence of feature vectors to correspond to the particularfeature vector in the representative sequence of feature vectors.
 14. Amethod according to claim 13, in which the transforming step furthercomprises the substep of averaging each feature vector in the alignedsequence of feature vectors with the corresponding feature vector in therepresentative sequence of feature vectors.
 15. A method according toclaim 13, in which the model is a Hidden Markov Model.
 16. A methodaccording to claim 15, in which the model is a left to right HiddenMarkov Model.
 17. A method according to claim 13, in which the model isa left to right Hidden Markov Model and in which the representativesequence of feature vectors comprises the same number of feature vectorsas the number of states in the Hidden Markov Model.
 18. An apparatus forspeaker recognition comprising receiving means for receiving a speechsignal from an unknown speaker; a speaker transform store for storing aplurality of speaker transforms each transform being associated with arespective one of a plurality of speakers; a speaker model store forstoring a plurality of speaker models each speaker model beingassociated with a respective one of said plurality of speakers;transforming means coupled to the receiving means and the speakertransform store, and arranged in operation to transform the receivedspeech signal according to a selected speaker transform; comparing meanscoupled to the transforming means and the speaker model store, andarranged in operation to compare the transformed speech signal with thecorresponding speaker model; and output means for providing a signalindicative of the likelihood that the unknown speaker is the speakerassociated with the selected speaker transform; wherein the transformstore stores each of said transforms as a representative sequence offeature vectors; and the transforming means comprises a start point andend point detector for detecting a speech start point and a speech endpoint within the received speech signal, a feature vector generator forgenerating a sequence of feature vectors derived from the input speech,and aligning means for aligning the sequence of feature vectorscorresponding to the speech signal between the detected start point andthe detected end point with a representative sequence of feature vectorssuch that each feature vector in the resulting aligned sequence offeature vectors corresponds to a feature vector in the representativesequence of feature vectors; wherein the aligning means further comprisefeature vector selecting means for, where a plurality of feature vectorsin the aligned sequence of feature vectors correspond to a particularsame feature vector in the representative sequence of feature vectors,selecting one of or an average of the plurality of feature vectors inthe aligned sequence of feature vectors to correspond to the particularfeature vector in the representative sequence of feature vectors.
 19. Anapparatus according to claim 18, in which the transforming means furthercomprises averaging means for averaging each feature vector in thealigned sequence of feature vectors with the corresponding featurevector in the representative sequence of feature vectors.
 20. Anapparatus according to claim 18, in which the speaker model store isarranged to store the speaker model in the form of a Hidden MarkovModel.
 21. An apparatus according to claim 20 in which the speaker modelstore is arranged to store the speaker model in the form of a left toright Hidden Markov Model.
 22. An apparatus according to claim 18, inwhich the speaker model store is arranged to store the speaker model inthe form of a left to right Hidden Markov Model, and in which the storedrepresentative sequence of feature vectors comprises the same number ofvectors as the number of states in the corresponding stored HiddenMarkov Model.